Before you start

My journal entry of Assignment 3 can be found here.

Brief Introduction

The data is from GSE178521 (ZHENG et al., 2022). This data set have 6 samples, 3 replicates for the control with GFP knockdown (GFPkd1, GFPkd2, GFPkd3) and 3 replicates for the test condition with PRMT5 knockdown (PRMT5kd1, PRMT5kd2, PRMT5kd3). The author aimed to figure out the function of PRMT5 in cancer cells.

I cleaned the data by removing the genes with 0 count for at least 3 samples. We got 15159 genes left. I did not removed any outliers.

I normalized the data using Trimmed Mean of M-values (TMM) normalization method in the edgeR (Robinson, McCarthy, & Smyth, 2010) package.

I then performed the differential gene expression analysis with edgeR where the conditions (PRMT5kd or GFPkd) was the only factor that I used in the design. The Benjamni-Hochberg (BH) correction (Benjamini & Hochberg, 1995) was used.

I performed the thresholded over-representation analysis using g:Profiler (Raudvere et al., 2019). I didn’t get the terms such as “negative regulation of T cells” and “type I INFγ response” which was shown to be enriched for the differentially expressed genes in the PRMT5 knockdown cells (PRMT5kd) (ZHENG et al., 2022). The reason is that the important genes related to these terms, such as Arg2 and CD274, didn’t pass the p-value threshold < 0.05.

Non-thresholded Gene set Enrichment Analysis

Prepare the ranked data

To perform GSEA (Subramanian et al., 2005), I needed to create the ranked set of genes. I got the top hits from edgeR differential gene expression analysis, where in total there are 15159 genes. I calculate the rank with the formula -log10(p-value) * sign(logFC).

library(knitr)

# load the edgeR result from A2
qlf_output_hits <- readRDS("./A3/qlf_output_hits.rds")

# added rank = -log10(p-value) * sign(logFC) into the dataframe
qlf_output_hits[,"rank"] <- -log(qlf_output_hits$PValue, base = 10) * sign(qlf_output_hits$logFC)
# order in decreasing order of rank
qlf_output_hits <- qlf_output_hits[order(qlf_output_hits$rank, decreasing = TRUE),]

# make the ranked list
ranked_genes <- data.frame(GeneName = rownames(qlf_output_hits), rank = qlf_output_hits[,"rank"])
write.table(ranked_genes, "./A3/all_ranked_genes.rnk", quote = FALSE, row.names = FALSE, sep = "\t")
kable(ranked_genes[c(1:5, 15155:15159),], type="pipe")
GeneName rank
1 ADAM19 5.525852
2 DDIT3 5.383518
3 FGG 5.160587
4 OLFML2A 4.704579
5 LIFR 4.479153
15155 HSP90AA1 -4.157425
15156 FRMD8 -4.468215
15157 PRMT5 -6.327536
15158 HYPK -6.346021
15159 IL1R2 -6.667572

GSEA Parameters

I used the baderlab geneset collection (Merico, Isserlin, Stueker, Emili, & Bader, 2010) from March 1, 2021 containing GO biological process, no IEA and pathways (Human_GOBP_AllPathways_no_GO_iea_March_01_2021_symbol.gmt) as our geneset database.

I ran GSEAPreranked with the following parameters (GSEA v4.2.3):

  • Required Fields:
    • Gene set database: baderlab geneset
    • Number of Permutations: 1000 (default)
    • Ranked list: my ranked gene set
    • Collapse/ Remap to gene symbols: No collapse
    • Chip platform: ftp.broadinstitute.org://pub/gsea/annotations_versioned/Human_HGNC_ID_MSigDB.v7.5.1.chip
  • Basic Fields:
    • Max size: 200
    • Min size: 15

GSEA Result

Many top terms for na_pos phenotype (up-regulated genes in PRMT5kd cells) are related to collagen and coagulation and wound healing, which is not seen in thresholded analysis results for up-regulated genes. Whereas in thresholded analysis results, the top terms are related to cell development. Since the paper mentions two terms “negative regulation of T cells” and “type I INFγ response” (ZHENG et al., 2022), I specifically looked for those two terms. In the thresholded analysis results, neither of them appears, while both terms appears in the GSEA results even though they are with a poor p-value. In A2, I found that the genes such as such as Arg2 and CD274, didn’t pass the p-value threshold < 0.05, so those genes are filtered out in thresholded analysis, that’s why the two terms can’t be found there. However, here we didn’t filter out those genes based on their p-value, so we would get the two terms in the results but not significantly.

NEGATIVE REGULATION OF T CELL ACTIVATION%GOBP%GO:0050868

  • ES: 0.47
  • NES: 1.31
  • p-value: 0.122
  • FDR: 0.286

TYPE I INTERFERON SIGNALING PATHWAY%GOBP%GO:0060337

  • 0.45
  • 1.29
  • 0.117
  • 0.305

For na_neg phenotype (down-regulated genes in PRMT5kd cells), I saw many terms related to basic biological processes, as seen in the thresholded analysis results.

The comparsion of the threshould analysis and non-thresholded analysis is not that straight forward to me since they used different annotation data source (except GO:BP). However, from the results I can still see that those non-significant genes play some roles on the analysis.

Visualize the Gene set Enrichment Analysis in Cytoscape

Enrichment Map

I used Enrichment Map App and set FDR q-value cutoff to 0.01 to visulize the GSEA results.

There are 372 nodes and 6303 edges in the resulting map. The node cutoff is 0.01 and edge cutoff is 0.375.

Figure 1: Cytoscape settings for the intial enrichment map

Figure 2: Enrichment Map for my GSEA results

Annotate the network

I used the AutoAnnotate App in Cytoscape to add annotation to my results with the following steps:

  • Select -> Select all
  • Apps -> AutoAnnotate -> New Annotate Set…
  • Check “Layout network to prevent overlap”
  • Use default setting for other parameters

Figure 3: Result after annotation using AutoAnnotate

Publication ready figure

I made the publication ready figure by checking the “Publication-Ready” in EnrichmentMap panel.

Figure 4: Publication ready figure for our network

Figure 5: Legend of the enrichment map

Theme Netowrk

I tried to create a theme network by collapsing all clusters in AutoAnnotate, but I could only get the figure below, without any words.

However, I would still looked at Figure 4 to find some major themes. The major themes includes apc g1 degradation, coupled transport electron, and decay nonsense targeting. I can’t really fit those themes to the model, and I don’t think there are any novel pathway or theme.

Figure 6: Failed theme network plot

Interpretation

  1. Do the enrichment results support conclusions or mechanism discussed in the original paper? How do these results differ from the results you got from Assignment #2 thresholded methods
  1. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your result?

Post Analysis

I added a signature analysis to the network using the nutraceutical drugs from DrugBank. I downloaded the annotation file from baderlab (Merico et al., 2010) from March 1, 2021 collection.

I used the Mann-Whitney (two-sided) test with a cutoff of 0.05.

Below are the terms that overlap:

Figure 7: Overlapped gene sets from nutraceutical drugs

The nutraceutical gene sets are mainly overlapped with the down-regulated gene sets in our dataset. I searched the NADH in DrugBank and NADH always serves as an electron carrier, which makes sense for the large overlap between the nutraceutical gene sets and the “coupled transport electron” which is down-regulated in the PRMT5kd cells.

There are also papers and clinical trials shows the use of nutraceuticals in the treatment of lung cancers.

Figure 8: Enrichment Map with the nutraceutical gene sets

Refernce

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300.
Merico, D., Isserlin, R., Stueker, O., Emili, A., & Bader, G. D. (2010). Enrichment map: A network-based method for gene-set enrichment visualization and interpretation. PloS One, 5(11), e13984.
Raudvere, U., Kolberg, L., Kuzmin, I., Arak, T., Adler, P., Peterson, H., & Vilo, J. (2019). G: Profiler: A web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Research, 47(W1), W191–W198.
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140.
Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., … others. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43), 15545–15550.
ZHENG, Y., Shen, L., Xiao, H., Hu, R., Zhou, B.others. (2022). PRMT5 inhibition promotes PD-L1 expression and immuno-resistance in lung cancer. Frontiers in Immunology, 5877.